Capturing the Expert: Generating Fast Matrix-Multiply Kernels with Spiral
نویسندگان
چکیده
Matrix-Matrix Multiplication (MMM) is a fundamental operation in scientific computing. Achieving the floating point peak with this operation requires expert knowledge of linear algebra and computer architecture to craft a tuned implementation, for a given microarchitecture. The expert follows a mechanical process for implementing MMM, by deriving the algorithm models from the literature. Then, by hand, applying optimizations which are well suited for that architecture. Lastly, the experts codes that implementation at the assembly level. In this paper, we argue that this process is mechanical and can be captured in an autotuning and program generation system such as Spiral. We then show that given this machinery, Spiral can produce code for large size MMM implementations that are competitive with hand tuned code.
منابع مشابه
A Fast GEMM Implementation On a Cypress GPU
We present benchmark results of optimized dense matrix multiplication kernels for Cypress GPU. We write general matrix multiply (GEMM) kernels for single (SP), double (DP) and double-double (DDP) precision. Our SGEMM and DGEMM kernels show ∼ 2 Tflop/s and ∼ 470 Glop/s, respectively. These results for SP and DP correspond to 73% and 87% of the theoretical performance of the GPU, respectively. Cu...
متن کاملFast Radix 2, 3, 4, and 5 Kernels for Fast Fourier Transformations on Computers with Overlapping Multiply-Add Instructions
We present a new formulation of fast Fourier transformation (FFT) kernels for radix 2, 3, 4, and 5, which have a perfect balance of multiplies and adds. These kernels give higher performance on machines that have a single multiply–add (mult–add) instruction. We demonstrate the superiority of this new kernel on IBM and SGI workstations. Key word. FFT kernels AMS subject classifications. 65-04, 4...
متن کاملOperator Language: A Program Generation Framework for Fast Kernels
We present the Operator Language (OL), a framework to automatically generate fast numerical kernels. OL provides the structure to extend the program generation system Spiral beyond the transform domain. Using OL, we show how to automatically generate library functionality for the fast Fourier transform and multiple non-transform kernels, including matrix-matrix multiplication, synthetic apertur...
متن کاملCode Generators for Automatic Tuningof Numerical Kernels : Experiences with FFTWPosition
Achieving peak performance in important numerical kernels such as dense matrix multiply or sparse-matrix vector multiplication usually requires extensive, machine-dependent tuning by hand. In response, a number automatic tuning systems have been developed which typically operate by (1) generating multiple implementations of a kernel, and (2) empirically selecting an optimal implementation. One ...
متن کاملGenerators for Automatic Tuningof Numerical Kernels : Experiences with FFTWPosition
Achieving peak performance in important numerical kernels such as dense matrix multiply or sparse-matrix vector multiplication usually requires extensive, machine-dependent tuning by hand. In response, a number automatic tuning systems have been developed which typically operate by (1) generating multiple implementations of a kernel, and (2) empirically selecting an optimal implementation. One ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014